FedDNA: Federated learning using dynamic node alignment

Federated Learning (FL), as a new computing framework, has received significant attentions recently due to its advantageous in preserving data privacy in training models with superb performance. During FL learning, distributed sites first learn respective parameters. A central site will consolidate learned parameters, using average or other approaches, and disseminate new weights across all sites to carryout next round of learning. The distributed parameter learning and consolidation repeat in an iterative fashion until the algorithm converges or terminates. Many FL methods exist to aggregate weights from distributed sites, but most approaches use a static node alignment approach, where nodes of distributed networks are statically assigned, in advance, to match nodes and aggregate their weights. In reality, neural networks, especially dense networks, have nontransparent roles with respect to individual nodes. Combined with random nature of the networks, static node matching often does not result in best matching between nodes across sites. In this paper, we propose, FedDNA, a dynamic node alignment federated learning algorithm. Our theme is to find best matching nodes between different sites, and then aggregate weights of matching nodes for federated learning. For each node in a neural network, we represent its weight values as a vector, and use a distance function to find most similar nodes, i.e., nodes with the smallest distance from other sides. Because finding best matching across all sites are computationally expensive, we further design a minimum spanning tree based approach to ensure that a node from each site will have matched peers from other sites, such that the total pairwise distances across all sites are minimized. Experiments and comparisons demonstrate that FedDNA outperforms commonly used baseline, such as FedAvg, for federated learning.


Introduction
Federated Learning (FL), originally proposed in 2016 [1], is a learning paradigm which builds machine learning models based on datasets distributed across multiple sites/devices in order to protect privacy and prevent data leakage. While traditional machine learning methods are typically trained based on centralized data, using FL provides a feasible way to develop models that can keep all the training data on distributed devices and update model parameters using immediate aggregation. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 As data collection and analytics are becoming increasingly popular, protecting data privacy and safety is becoming a major concern for business, government, and nearly all sections of human society. By deploying FL, each participant in the model training process can build one model together without sharing data, naturally results in data privacy protection. Traditional machine learning methods need to concentrate training data in a certain machine or a single data center, which means in order to meet the gradually increasing data level, it is necessary to continuously add machines and build infrastructure. Such method not only greatly increases the cost but also hinders the efficiency building models. In contrast, FL allows all the needed data stay in their local places without the need to build specific data center to aggregate them, at the same time, each part of the data will be used to develop the model. Such efficient characteristic enables Federated Learning to be widely used in multiple areas especially in the healthcare domain.

Federated learning in healthcare
The shift from written health records to electronic health records has been instrumental in driving the use of patient data to improve the healthcare industry. The adoption of electronic health records enables health care professionals to disseminate knowledge across all sectors of health care, which in turn helps to reduce medical errors and improve patient care and satisfaction. However, as mentioned previously, adequate medical data sets are difficult to obtain. However, in order to capture the subtle relationships between disease patterns, socioeconomic and genetic factors, and complex and rare cases, exposing the model to different cases is critical. FL is able to address this issue by enabling the distributed training of machine learning models using remotely hosted datasets without the need to accumulate data and therefore compromise the data privacy [2][3][4][5][6][7].
While FL is capable of making use of data across different sites/institutions, there are still several data acquisition issues which can cause bias during model develop process. First of all, due to data privacy limitation, the Health Insurance Portability and Accountability Act (HIPAA) has set up regulations for healthcare organizations to manage and safeguard personal information and address their risks and legal responsibilities in relation to processing personal patients data [8]. This leads to strict data share policies of each healthcare provider, which, limits the amount of available data source. Another issue is that there exist hospital speciality gaps between different hospitals, in other words, healthcare providers might focus on several particular diseases treatment instead of performing general hospitalization. In this case, there are big chances where FL models trained across all different disease focus datasets will perform predictions with certain disease-specific bias. In addition, biases also exist when patients demographic characteristics differ. Different income groups, age groups, genders, and geographical locations and living environments will all affect the overall patient characteristics that admitted to different regional hospitals, thus, data bias can also be observed in such kind of dissimilarity. Therefore, it is essential to reduce all the above biases when we try to develop a federated learning model to make crucial medical predictions. We aim to design a novel federated learning model that can take this kind of bias into consideration at the first step where node weight aggregation takes place.

Federated learning uniqueness and limitation
Despite the fact that use of traditional machine learning techniques (TML) in combination with electronic health records (EHR) is gaining popularity as a means to extract knowledge that can improve decision-making processes in healthcare, they usually require the training of high-quality learning models based on diverse and comprehensive datasets that are difficult to obtain due to the sensitivity of medical data from patients. Meanwhile, although distributed machine learning [9] has addressed parallel computing in handling large scale data, these methods are often designed to tackle the data volumes using frequent data exchange. In addition, switch learning models are often prohibitively expensive/inconvenient, making it difficult for end users to try/implement different learning algorithms. On the contrast, FL enables devices to collaboratively learn shared predictive models while keeping all training data ondevice, decoupling the power of machine learning from the need to store data in the cloud. This goes beyond using native models to make predictions on mobile devices and also brings model training to the device. Table 1 summarizes the main difference between federated learning, traditional machine learning methods, and distributed machine learning methods. In summary, the inherent advantage of federated learning is that is allows flexible modeling training and continuous learning on end-user devices while ensuring no end-user data leaves the device. Fig 1 shows how FL works. Global model M is downloaded from the central server to each client when it comes to training the model, after which the downloaded model is trained by each client using their own dataset. Once the training process is completed, each client needs to update their updated training parameters to the central server and the central server would aggregate the learnt parameters (parameter aggregation) and pass the aggregation results to the global model, therefore, one update for the global model is accomplished and this process is called Global update. Once global update is finished, model parameters will be passed from the global model to each local model for Local update, where clients' model parameter will be updated with the new aggregated model weights to start a new round training. [4].
Parameter aggregation is one of the most important steps of the federated learning. Among all existing methods, Federated Averaging (FedAvg) is the most commonly used method. Eq 1 summarizes the global weight values w updating of FedAvg in each training round t, in which k is the client index, K means the total number of clients, n is the total number of instances and n k is the local data examples for each client [1]. Overall, Eq 1 indicates that the global weight vector w is the weighted average of weight values obtained from local clients. In this paper, a bold-faced symbol denotes a vector or a high dimensional vectors (e.g. a matrix).
Recently, other weight aggregation methods have also been proposed in FL. For example, anomaly score of each client is taken into consideration to detect abnormal client behavior, thus, clients will not contribute equally when global model updates the weight values, the majority of those novel methods are still based on FedAvg [10,11]. Even though this method is widely used and has been proved with good prediction performance [12,13], due to the nature Table 1. Comparison between Federated Learning (FL), traditional machine learning (TML), and distributed machine learning (DML) algorithms. DML methods are commonly data driven (DML d ) or computing driven (DML c ). Data driven methods (DML d ) mainly try to learn from large volume distributed data, whereas computing driven methods (DML c ) aim to parallelize computing in learning from centralized data. Computing framework\ refers to the whole eco-system for learning, and model switch refers to easiness of switching a new learning model. of hidden layers in deep learning neural networks, we can clearly observe that this method manually forces weight aggregations between neurons located at the exact same location (i.e., same layer and same node index) of two networks. However, when training two same-structured deep learning networks N A and N B , even they are given the same input, neurons at the same location of the two networks do not always give the same update. In other words, certain property of the input (or the same instance) may trigger the most significant activation to the i-th node of N A , but same instance may triger the most significant activation to the j-th node of N B . Meaning that same instance responds differently for the same lactation nodes between two networks. In order to demonstrate the above hypothesis, we create a simple dense neural network N D with one input layer, two hidden layers and one output layer. One dataset with 10 features is fed into N D . For the i th node in the first hidden layer N 1 D , there will be 10 weight values fw 1 i;0 ; w 1 i;1 ; . . . ; w 1 i;9 g corresponding to the 10 input features (the superscript denotes the first trained network). After we train N D from scratch for five times with the exactly same dataset, a node e is randomly chosen from all five networks (with the same node index), from which we will get 10 weight vectors of {w e,0 , w e,1 , . . ., w e,9 } in which w e;0 ¼ ½w 1 i;0 ; w 2 i;0 ; w 3 i;0 ; w 4 i;0 ; w 5 i;0 � represents all five trained weight values corresponding all five networks' first indexed node and first feature dimension as shown in Fig 2. After that, we calculate the variance of w e,0 , and repeat the same for all 10 nodes. Fig 3 reports the variance of the weight values across all five trained network. The high weight variance in Fig 3 concludes that weight aggregation by static node matching will not only add uncertainty to model performance, but also will hinder the practical application of Federated Learning in industry.

Method
In this paper, we aim to design a dynamic node matching method, FedDNA, to aggregate weight values in each round based on a neuron-distance method, in which neuron distances across all the clients are calculated after each client completes training the model parameters with their own data. After that, the closest neurons are matched to calculate their average weight values as new parameter for the global model. Fig 3 reports weight variance of the matched nodes trained using same setting as the static node matching. The results show that dynamic node matching provides much smaller weight variance across all nodes of different networks. The advantage of reducing variance is that it allows nodes with similar behaviors to be aggregated for weight averaging. This potentially results in stable and improved federated learning performance. In summary, the main contribution of the proposed research is summarized as follows: • Dynamic node alignment: We propose a new dynamic node alignment framework, FedDNA, for weight aggregation in federated learning. Instead of using fixed node index to match nodes across different sites, FedDNA finds the best matching nodes based on node weight values, such that nodes, of the same layer, with the most similar response to the input are considered as one new node for next round training.
• Fast node alignment: To increase node aliment speed, we propose a Minimum Spanning Tree (MST) based method to find global optimal matching nodes across all sites.
• Alignment and frozen: In each training process, after finding the matching nodes at the very beginning, node matching will be frozen and federated average will be used for the rest of training rounds. By doing this, we can ensure the matching nodes orders which will not be disturbed by subsequent training.

Dynamic node alignment
Conventional FedAvg is proposed to strictly average local node weights in order to update the global model parameters, which has inspired researchers to try to come up with more flexible and reasonable node alignment ideas. A new federated learning algorithm called Fed2 incorporating a feature alignment strategy in order to enable local models to align feature representations with a global model by introducing a feature projection layer is proposed and validated [14]. In this method, the authors introduce a feature projection layer, which is able to map the features extracted by the local models to a common feature space that is aligned with the global model. During training, the local models use this layer to project their features into the common feature space, where the distances between the projected features of both the local models and global models can be minimized. A method for aligning two models in federated learning by matching neurons that perform similar functions is proposed, in which the matching is done by projecting the local update onto the tangent space of the reference model, which ensures that the update is aligned with the reference model's geometry. The weights used in the averaging are proportional to the cosine similarity between each local update and the reference model. This weighting scheme gives more weight to updates that are more aligned with the reference model and less weight to updates that are less aligned, which helps to improve the overall convergence speed and accuracy of the algorithm [15]. Based on the previous approach, a new approach Multi-Center Federated Learning that aims to improve personalization by clustering clients based on their data distribution is proposed. The multi-center aggregation mechanism this approach involves aggregating the local models from multiple centers to obtain a global model. Each center trains its own local model based on the data of the clients in its cluster. The local models are then aggregated by exchanging information between the centers to improve the global model [16]. https://doi.org/10.1371/journal.pone.0288157.g003

Federated learning node aggregation
As more attention to data privacy protection are being payed, Federated Learning has become one of the most popular areas. Certain number of studies have also proposed creative approaches related to weight update in Federate Learning. LoAdaBoost FedAvg is proposed to achieve higher model prediction accuracy on distributed intensive care data, in which local models with a high cross-entropy loss were further optimized before model averaging on the server [17]. Federated-Autonomous Deep Learning (FADL) is designed to update global model by training part of the model using all data sources in a distributed manner while the rest of the model is trained with data from specific data sources [18]. When it comes to IID data, Haddadpour and his colleagues introduce a framework called federated averaging with compression (FedCOM), which the global model is decided not only by the update by the average of all clients' training results, but also determined by the previous global model [19]. A model poisoning attack is proposed to perform adversary controlling of a small number of malicious clients in order to ensure weight updates are not being poisoned by those clients [20]. Similarly, abnormal clients are observed at the server side in the proposed detectionbased method to timely detect unusual clients behaviors to prevent abnormal server model updates, in which a low dimension substitute of the weight vectors is created for anomaly detection [10]. A novel FL approach using mutual information (MI) which weight updates are reformulated by minimizing the MI between the local model and the aggregate model and adopting a negative correlation learning (NCL) strategy on the client side. The convergence of this algorithm theoretically is proved by experiments on MNIST, CIFAR-10, ImageNet and clinical MimIC-III datasets [21]. A hierarchical federated edge learning framework is proposed in order to solve the optimization issue of device scheduling and resource, in which immediate hospitals are in charge of part of model aggregation work [22]. To prevent unreliable updates from untrusted devices, a new concept called reputation has been introduced in the context of FL-enabled healthcare systems. This reliable device choice plays an important role in mitigating multiple security attacks [23]. A new optimization algorithm for federated learning that leverages over-the-air computation is proposed, in which the authors aim to improve the convergence rate and accuracy of federated learning by introducing a new learning rate optimization algorithm that takes into account the channel conditions and interference between the devices in a federated learning network. This new algorithm uses a feedback mechanism to adjust the learning rate dynamically, which improves the convergence rate and accuracy of the federated learning model [24]. Federated Loss-Weighted Averaging (FedLWA) scheme is a key component in an unsupervised recurrent federated learning (URFL) algorithm to enhance the parameter aggregation, in which each edge device calculates a weight based on the loss function of its local model during training [25]. Local-update SGD method which is a technique running stochastic gradient descent (SGD) parallel on various workers and the sequences will only be updated once in a while has been proved with faster convergence and less communication cost for Federated Learning [26]. Two ideas based on SGD are tested and proved convergence. The first strategy is local steps and the second idea aims to share certain information and do local computations. The results show higher convergence speed and more efficient communication on Federated Learning settings using these two SGD based methods [27]. Several gap settings are emphasized between the upper and lower limits of optimization associated with federated learning, particularly for "intermittent communication graphs" that capture local SGD methods but whose convergence rates are not known to match the corresponding lower limits [28]. Computational heterogeneity generated by Local-update SGD in Federated Learning is analyzed and one solution to the client fragmentation problem is to fix the number of local updates at a particular point, but allow clients to update the global model fashion asynchronously or without locking [29]. Researchers assume that all clients are involved, at the same time using batch gradient descent on all clients may be better than using stochastic gradients [30]. Li et al. [31] studied SGD convergence in a more realistic environment for joint averaging, involving only a subset of customers per round. To ensure convergence, they assume the probability that the client is randomly and uniformly selected or proportional to the size of the local data set. The problem of client characteristics varying over time throughout the day is introduced and the convergence of semi-cyclic SGD is studied in the research [32], in which multiple blocks of clients with different characteristics are sampled from regular cyclic patterns, such as day and night. Due to the heterogeneity of computing power, clients can perform different local steps. Periodic decentralized SGD (PD-SGD) is proposed and is proved to be able to allow multiple local updates happed based on decentralized SGD in Federated Learning [33]. Yu et al. recommend allowing each customer to maintain local momentum buffers and average local buffers and local model parameters per communication turn. This approach, while empirically improving the final accuracy of local SGD, doubles the cost per round of communication [34] A new technique, FedFast, is presented to accelerate distributed learning to achieve good accuracy for all users early in the training process by sampling from a different set of participating customers in each training round and applying an active aggregation approach to propagate the updated model to other customers. Authors demonstrate the effectiveness of their approach on various benchmark datasets and compare it with state of the art recommendation techniques [35].

Federated learning for healthcare
Since FL is a general learning paradigm that eliminates data pooling requirements for AI model development, it has applications across multiple scenarios, especially the entire AI healthcare [36]. A FL-based privacy-aware and resource-saving collaborative learning protocol was introduced in [37] for an EHR analysis management system working with multiple hospital institutions and cloud servers, where each hospital runs neural network models with its own EHR with the help of cloud computing. In addition, an FL-based approach was proposed to predict hospitalizations in patients diagnosed with heart disease using their historical EHR. More specifically, health data from an EHR system consisting of patient smartphones and distributed hospitals is trained locally on demographic information such as age, gender and physical characteristics [6]. [38] proposed a FLT scheme for wearable health monitoring, in which smartphones and cloud servers cooperated to train and share CNN model for the identification of privacy-conscious human activities. A disease prediction method using FL with a large national health insurance data set of 99 medical sites (such as hospitals and clinical laboratories) distributed across 34 states in the United States is studied [39]. The data included EHR for diabetes, psychological disorders, and ischemic heart disease. The FL approach achieves competitive performance in terms of high accuracy and privacy by comparing with traditional methods such as centralized learning and local training without federation. also builds a FLbased health monitoring solution for analyzing patient outcomes from distributed hospital networks. Interestingly, each hospital created an entity called the Personalized Treatment Effect Estimator. Each estimator can be classified in each subgroup, where individual treatment outcomes include outcomes of patient characteristics, and site indicators are used to estimate overall treatment outcomes at coordination sites [40].
A combined brain imaging method is proposed to utilize MRI scans distributed across multiple clinical centers and institutions. In this case, through collaboration between the medical site and the central server, an FL model was derived to simulate an end-to-end framework for data standardization, confounding factor correction, and high-dimensional feature variability measurements [41]. To facilitate X-ray scanning in intelligent health care, a FL-based approach is proposed to support the diagnosis of acute neurological symptoms such as severe headache or loss of consciousness. Each hospital runs a CNN-based DenseNet1212 model that supports feature propagation, encourages feature reuse, and minimizes the number of neural parameters to train the X-ray image dataset provided by the North American Radiology Society [42]. In addition, a dynamic fusion-based FL method is proposed for CT scan image analysis, which can diagnose COVID-19 infection through customer participation and customer selection. First, each customer, such as a medical institution, makes a decision about whether to participate in the FL rotation based on the performance of the new training model. The central server also calculates the update time to determine which clients are allowed to update their local gradient [43]. FL also combines with deep learning to build a deep collaborative learning solution for detecting COVID-19 lung anomalies in CT. The internal data set was collected from a total of 75 confirmed COVID-19 patients at three local hospitals in Hong Kong for FL simulation and was then validated by data from mainland China and Germany [44].
Our work differs from the above methods because our dynamic node alignment aggregates weights of matching nodes for FL by ensuring the best matching nodes across different sites.

Motivation
Instead of using fixed node matching, like FedAvg does, we propose to use dynamic node matching to find matching node between different sites, and then aggregate weights of matched nodes to calculate weight values of the global model. During the FL process, the sites will pass their local weight values to the center, so the center will carry out node matching before aggregating site weight values. Our idea is to use weight values of each node as a feature vector to find matching nodes. Because weight values of a neuron are associated to each features, for nodes at the same hidden layer, they will have same input space. This allows us to use weight values to find distance/similarity between nodes for matching.
To make sure weight values are aggregated from the most similar nodes crossing all clients C, at the first step, nodes distances are calculated across all clients as shown in the distance matrix in Table 2  ensure that the matching are across all clients. A minimum spanning tree (MST) or minimum weighting tree is a subset of edges of a connected edge-weighted undirected graph that joins all vertices together without any loops and with the smallest possible total edge weights. That is, it is a spanning tree with the smallest possible sum of edge weights. More generally, any edgeweighted undirected graph (not necessarily connected) has a minimum spanning forest, which is the union of the minimum spanning trees of its connected components [45]. In our example in Table 2, a distance mapping is plotted to demonstrate how the matching process works. At first node distances are calculated across all sites, in this case, 3 sites. we start the matching process from node a because it has the smallest distance 0.11 across all the nodes. we can observe that node B has the smallest distance with it, therefore, B will be matched to a. For the next step, we are using MST to find the next matching node for {a, B}, which in this case, will be node α. This MST matching process will continue until all the nodes are matched across all clients as shown in Fig 4.

Dynamic neural network node matching
In the proposed method, one key step is to find the closest nodes based on distance calculation in each round. This step is carried out at the center, and the aggregated weights are then dispatched to the federated learning site for the next round. The node matching is applied to one specific hidden layer of all networks, one at a time. By default, we are referring to nodes in the first hidden layer for ease of explanation. The same matching process is applicable to any other hidden layers as well. Algorithm 1 outlines the main steps of FedDNA for matching nodes across networks. Overall definition of the symbols used in our node matching is shown in Table 3.
. . . ; w s i;m � is the weight vector of the i-th node in its first hidden layer. C ¼ fc 1 ; c 2 ; . . . ; c S g means the set of clients c and c a ¼ fv a 1 ; v a 2 ; . . . ; v a n g is nodes weight vector of client c α . Node weight vector of client c α 's i-th node is denoted by v a i ¼ ½w a i;0 ; w a i;1 ; . . . ; w a i;m �. Neuron matching distance calculation. Given two neurons v a i and v b j at the same layer, because they have the same input dimensions (In this paper, we are using dense network architecture, so neurons at the same layer are connecting to all inputs/nodes of the preceding layer), we can represent each nuron as a vector, and calculate distance/similarity between neurons using the vectors. Assume for any particular layer, the input dimension is m, and the weight values of . . . ; w b j;m �, respectively. Node distance between v a i and v k i can be calculated with Euclidean distance defined in Eq 2 or using Manhattan distance defined Eq 3. The Euclidean distance between two points in Euclidean space is defined as the length of the line segment between the two points, which essentially represents the shortest distance between two points. Manhattan distance is a distance measure between two points in an m-dimensional vector space. It is the sum of the projected lengths of the line segments between the points on the coordinate axes. In simple terms, it is the sum of the absolute differences of two points measured in all dimensions.
During the node matching process, we will be growing a tree (i.e. a minimum spanning tree MST) to link matched/aligned nodes across all sites. In this case, a tree T consists of a set of neurons, i:e:; T ¼ fv a i ; . . . ; v b j ; . . .g where α 6 ¼ β. We enforce α 6 ¼ β such that an MST tree only contains one node from each site (because we are trying to find matching nodes across all sites. It does not make sense to have a neuron to match a node of the same network). The number of nodes in the tree T varies, as the tree is growing dynamically. However, after the matching, each node only belongs to one MST tree, and the final number of nodes in the MST tree equals to the number of sites of the FL learning framework. We do not record edges connecting nodes in the tree, because our goal is to find matching nodes as a group, and then use their weights to update center's node weights. In this case, the pairwise relationship between sites is not important to us. Also, each tree T records its membership nodes and will use their weights to calculate the average weights, which will be pass to respective members of the tree T for next round FL learning.
During node matching, we need to expand the tree T and include matching node to the tree. Therefore, we define the distance between a node v and Minimum Spanning Tree T as in Eq 4. The distance from a node to a Minimum Spanning Tree tree arg min dðv; T Þ equals to its distance to its closest node in the tree.
Minimum spanning tree for neuron alignment across sites. At the first step, each client downloads the model from central server and train it with its local data, after which client c α is randomly chosen from C. One node v a i will be randomly selected among all the nodes in the first hidden layer of client c α 's local model. In the second step, another client c k will be chosen at random from {C-c α }. A distance function explained previously will be used to calculate the distance dðv k j ; v a i Þ between all nodes in the first hidden layer of client c k model and node v a i . We can get two nodes matched ðv k j ; v a i Þ based on the smallest distance. Now we have two nodes, which are also the start of our MST tree T ¼ fv k j ; v a i g, from which we will start to grow the tree. MST is the one whose cumulative edge weights have the smallest value, and in our proposed method, it means the one whose cumulative node distances have the smallest value. In each matching step, we will randomly pick one client from {C-{c α , c k }}. Node to tree distance Eq 4 will be applied to find the subsequent matching nodes to join the MST tree T . The MST tree T will continue growing until {C-{c α , c k ,. . .}} is empty and at the same time, a complete tree T with new node set {v a i , v k j , v b t ,. . .} will be formed to aggregate their averaged weight values as a new node weight v s i for the global model. To illustrated the above description, for example, one client is randomly chosen in Fig 5 then in Fig 6, d(a, α), d(a, β),  d(a, γ), d(a, θ), d(B, α), d(B, β), d(B, γ), d(B, θ), then choose node α with the smallest distance and node {a, B, α} are the matching nodes. Weight values {v 1 a ; v 2 B ; v 3 a } will be averaged to be considered as a new node value for the global model.  Dynamic node alignment vs. Frozen. In our proposed method, frozen means instead of using dynamic node alignment through the entire training process, we choose to train the federated learning model with dynamic node alignment for certain rounds at the very first beginning, then static node alignment will be applied for the rest training part. By doing so, nodes with similar response will be paired right after the training process starts and once all the neurons are matched during the first certain rounds, we believe that the node pair pattern will discovered and fixed to a certain extend, therefore, using static node alignment will prevent the pattern from being disturbed from subsequent training process.
Theoretical analysis. In this subsection, we analyze the time complexity of FedDNA, and compare its complexity with simple global optimal matching search. Denote S the number of sites, n the number of first layer nodes at each site, and m the number of features for each neuron. Because all sites in FL setting have same network structure, we only focus on first layer, and the same analysis applies to other layers, if dynamic node alignment is carried out beyond the first layer.
Finding global optimal matching (i.e., the sum of matching distances of all nodes, across all sites) is a combinatorial problem, because it requires comparisons of all nodes against all other nodes, across all sites. For two sites, each having n nodes, the matching complexity is Oðn � n � mÞ, because it needs to cross check all pairs (and each pairs involve m feature dimension comparisons). Adding a third site would result in Oðn � n � n � mÞ complexity because all node pairs between three sites need to be checked. As a result, for S sites the total complexity is Oðn S � mÞ, which grows exponentially with respect to the number of sizes.
For FedDNA, finding matching nodes across all sites for one node requires OððS À 1Þ � n � mÞ complexity because a node needs to search all nodes from other sites and it does not need to search nodes from the same site. Once the first node is matched (across all sites) and matched nodes are added to the minimum spanning tree (MST), the next node matching requires OððS À 1Þ � ðn À 1Þ � mÞ complexity because there are n−1 unmatched nodes remain for each site. As a result, total time complexity for all nodes (across all sites) is the sum of all individual nodes' complexity: OððS À 1Þ � n � mÞ þ OððS À 1Þ � ðn À 1Þ � mÞ þ . . . þ OððS À 1Þ � 1 � mÞ ¼ OðS � n 2 � mÞ. By growing minimum spanning tree (MST) to support the matching, FedDNA reduces the exponential complexity from Oðn S � mÞ (for global optimal matching) to quadratic OðS � n 2 � mÞ.
In summary, FedDNA's complexity is linear with respect to the number of sites, and quadratic with respect to the number of nodes at each site.

Datasets
We used four benchmark datasets in the experiments. The first one is Diabetes Data Set which data source are obtained from two main aspects, an automatic electronic recording device and paper records to predict whether a patient has diabetes or not. For the automated electronic

PLOS ONE
recording devices, they have an internal clock to mark events, whereas paper records only provide periods of "logical time" (breakfast, lunch, dinner, bedtime). For paper records, fixed times are assigned to breakfast (08:00), lunch (12:00), dinner (18:00) and bedtime (22:00). Therefore, paper records have a fictional uniform time of recording, while electronic records have a more real time stamp [46]. The second dataset we used is Spam_base Data Set from UCI which was created by spam emails from postmaster and individuals and non-spam emails from filed work and personal e-mails in order to construct a personalized spam filter. In this dataset, The last column indicates whether the email is considered spam (1) or not (0), that is, unsolicited commercial email. Most properties indicate whether a particular word or character occurs frequently in emails [47]. Another data set used in this paper is called Patient Survival Prediction Dataset. It uses knowledge about patient chronic conditions from Intensive Care Units (ICUs) to inform clinical decisions about patient care and ultimately predict patient's survival outcomes [48]. Occupancy Detection Data Set is the last data set we used to verify our model's performance. It is a dataset for predicting room occupancy using environmental factors such as Temperature, Humidity, Light and CO2. Ground-truth occupancy obtained from time stamped pictures that were taken every minute [49].
Basic descriptions about these four datasets are shown in Table 4 from which we can observe the number of samples in each dataset, diabetes database has 1150 samples and there are 4601, 91714 instances in the spambase data set and Patient Survival Prediction data set separately. Patient Survival Prediction set has the most samples and also the largest attributes while Occupancy detection data set has the medium size samples with the least number of attributes. Apart from that, data dimensions of those four are also different with various feature types such as categorical features, numerical features. One same point is that there are only two classes in all the datasets, which means binary classification will be performed in our proposed model.

Baseline methods
To validate the performance of the proposed method, we use deep neural networks as the training models and employ four baselines for our comparisons. One is plain neural network (Plain_NN) model which has the same structure as our proposed model which has one input layer, two hidden layers and one output layer.
FedAvg. The second baseline is called Federated Average (FedAvg), which also share the same network structure with our proposed method and use static node matching to aggregated node weight values for the global model. In FedAvg, each client downloads the current model from a central server, improves it by learning from its own local data, and then aggregates the changes into a small centralized update. Only updates to the model are sent to the server/cloud using encrypted communication and immediately averaged with other user updates to improve the shared model based on Eq 1. All training data is kept locally and no individual updates are stored in the cloud. Federated Average (FedAvg) is a generalization of FedSGD that allows local nodes to perform multiple batch updates to local data and swap updated weights instead of gradients. The basic principle behind this generalization is that in FedSGD, if all local nodes start from the same initialization, the mean gradient is strictly equivalent to the mean weight itself. Furthermore, averaging adjustment weights from the same initialization does not necessarily harm the performance of the resulting averaging model [1,7].
FedDyn. Next baseline is call FedDyn, in which each client learns a unique model with its own regularization parameter [50]. In this method, each client in the federated learning system learns a unique model with its own regularization parameter. The regularization parameter is updated dynamically during the training process based on the client's local model performance. This means that clients with more difficult data can have a higher regularization, while clients with easier data can have a lower regularization, which improves the convergence speed and accuracy of the federated learning process.
The objective of FedDyn is to solve. Eq 5, where kÂ 2 [m] consists of N k training instances, L k (θ) is the empirical loss of the k th device and θ are the parameters of the neural network.
FedDNA fixed . Baseline 4 (FedDNA fixed ) calculates nodes' distance based on a fixed node. This baseline is created because we want to confirm whether the node matching pattern in dynamic node alignment improve compared with when the node used for matching remains the same. At the first step, after each client finishing training its local model, client c α is randomly chosen from C. Then one node v a i will be randomly selected among all the nodes in the first hidden layer of client c α 's local model. In the next step, one client will be randomly c k picked from {C-c α }, a distance function explained previously will be used to calculate the distance dðv k j ; v a i Þ between all nodes in the first hidden layer of client c k model and node v a i . We can get two nodes matched ðv k j ; v a i Þ based on the smallest distance. Now we have two nodes, which are also the start of our MST tree T ¼ fv k j ; v a i g, from which we will start to grow the tree. In each matching step, we will randomly pick one client from {C-{c α , c k }}. Unlike FedDNA, in this baseline, the distance between a node to the tree will be calculated with Eq 6, which means that only v a i will be used in MST tree T to do the node matching. The MST tree T will continue growing until {C-{c α , c k ,. . .}} is empty and at the same time, a complete tree T with new node set {v a i , v k j , v b t ,. . .} will be formed to aggregate their averaged weight values as a new node weight v s i for the global model.
For example, in Fig 5, distance for c 2 will be d(a, A), d(a, B), d(a, C), d(a, D) and for c 3 the distance will be d (a, α), d(a, β), d(a, γ), d(a, θ). Assume for c 2 , the smallest distance is d(a, B) and d(a, α)for c 3 , then node {v 1 a ; v 2 B ; v 3 a } are the matching nodes and their weight values will be averaged as one new node weight values for the global model. FedDNA random . The last baseline is a modification based on both FedDNA and FedDNAfixed . Instead of being too static or too dynamic with the node matching, we cant to confirm the feasibility when the matching node is neither 100% percent fixed nor using the entire MST tree as a matching node. Settings for baseline 4 (FedDNA random ) is as follows: At the first step, after each client finishing training its local model, client c α is randomly chosen from C. Then one node v a i will be randomly selected among all the nodes in the first hidden layer of client c α 's local model. In the next step, one client will be randomly c k picked from {C-c α }, a distance function explained previously will be used to calculate the distance dðv k j ; v a i Þ between all nodes in the first hidden layer of client c k model and node v a i . We can get two nodes matched ðv k j ; v a i Þ based on the smallest distance. Now we have two nodes, which are also the start of our MST tree T ¼ fv k j ; v a i g, from which we will start to grow the tree. In our third step, one node will be randomly chosen from {v a i , v k j } which will be used to match nodes of client {C-{c α , c k }} using Eq 4. Step 3 will be repeated until {C-{c α , c k ,. . .}}is empty and at the same time, a new node set {v a . .} will be formed to aggregate their averaged weight values as a new node weight value for the global model. Assume we randomly choose c 2 in Fig 5 to do the first match, node {a, B} are the matching nodes, then one node will be randomly chosen from node {a, B} to calculate distance for c 3 . If node B is chosen, distance d(B, α), d(B, β), d(B, γ), d(B, θ), will be calculated to choose the next matching node.

Experiments setting
Our overall experiment setting is shown in Table 5. We use 10-fold cross validation, under which there will be 10 training rounds for each model to train. For each dataset, our aim is to predict the corresponding target and 10-fold cross validation is applied to reduce both bias and variance. Under each cross validation fold K, same weight values are initialized for all both baseline models and our proposed models, Plain_NN, FedAvg, FedDyn, Baseline 3 , Baseline 4 and FedDNA. For methods under FL setting, model parameters will be passed to each clients at the very beginning of training. Training data will be randomly split into 5 sites and distributed to 5 clients, which is able to training the local model using their own data, after which weight values will be aggregated based on different FL method and then send back to the global models. Global models will pass the new calculated parameters to their local clients to start new round training until the convergence. For our proposed method FedDNA, there are two experiment settings in this paper. One is called no-freezing weight update setting, in which weight values of the global model will be aggregated using FedDNA method for all the 10 Table 5. The pseudo code of the experiment settings and comparisons (all methods are compared based based on same training/test data. The initial network weights of each site are the same for different methods to avoid discrepancy due to random weight initialization. rounds while the second type of setting is called freezing, in other word, we will choose to update the global model parameters with FedDNA at the first i round and after that FedAvg will be used to aggregate clients' model weight values for the rest of rounds. We design this type of setting because we think the first several rounds of distance calculation will give use the answer of the closest matching nodes then we can use that match to directly aggregated the node weight values.
For our experiment dataset settings, we first run our model based on the original class distributions across all clients in all datasets which is negative: positive = 1.54 in spam database, negative: positive = 0.89 in diabetes data base, negative: positive = 11.26 in Patient Survival Prediction data set and negative: positive = 3.33 in Occupancy Detection data set. In the second experiment setting, for each training process, 2 clients are randomly chosen to exchange 2/3 of their data while the rest 3 clients keep their own data, in this case, our model will be verified on non-IID datasets. Calculated overall node distance, Accuracy, AUC, F1_score, Balanced accuracy and Loss are used as performance measurement metrics.
Apart from randomly selecting 2 clients to exchange their data, we decide to evaluate our proposed model under different class distribution settings. The original class distribution (negative:positive) of the four datasets are as shown in Table 4. A set of class distributions is set up for the original four data sets to check the model performance. Since all the datasets have different original class distributions, the assigned class distributions of the four datasets in this paper are different from each other. Tables 6-9 show the results for Diabetes dataset, Spam dataset, Occupancy data set and Patient survival data set respectively in our first dataset setting. Due to page limitation, only the best model performance results are presented in this paper. For Diabetes database, we can observe that FedDNA, which uses our proposed method FedDNA is able to find nodes combinations where the total node distance is the smallest with value 42.1352 compared with other methods whose final distance results are greater than 50. At the same time, FedDNA presents better metrics performance. Similarly, the smallest overall node distance and better metric performance are realized by our FedDNA method for spam database. However, we can also come to the conclusion that smaller overall node distance and better metrics performance does not always come with the smallest training loss, especially for FedDNA. For occupancy and patient survival prediction datasets, FedDNA shows similar performance as for the previous two datasets. Its overall classification performance outperforms all the baselines with the smallest node distance 5.7316 and 57.4096 respectively after matching, which indicates FedDNA is able to pair closest nodes together. We can tell that overall, for all the four datasets, when class distributions are the same across all clients, our proposed method performs the best in the freezing  Fig 7, in which outliers can be observed for three models but overall we can come to the conclusion that when data is not evenly distributed across all clients, FedDNA performs the best in the freezing setting when the first two rounds using Manhattan distance to find the matching nodes and the rest using FedAvg. Since under this experiment setting, FedDyn does not deliver better overall performance than FedAvg according to above tables and figure, its detailed comparison with FedDNA is not demonstrated. Figs 8-11 report the performance of FedDNA and FedAvg, with respt to different class distributions (the class distributions were adjusted to assess the algorithm performance under different conditions). The y-axis is the values of each measurement and x-axis is different class distribution set ups for each dataset as shown in Table 4.

Overall model performance
For Diabetes Dataset, FedDNA and FedAvg has the largest gap for all the measurements when sampling rate is 0.5 and both models' performance fluctuate a lot with the change of class distributions. For Spam Dataset, even though FedDNA and FedAvg perform similarly when class distribution is smaller, as more and more negative samples shown in the datasets, FedDNA starts to show more advantages than FedAvg, especially when negative instances take up more than 40% of the dataset, the gap between both models become larger with a better performance from FedDNA. Similarly, for Patient Survival Prediction Dataset, the larger the sampling rate is, the better FedDNA outperforms FedAvg especially in terms of Fscore and Balanced accuracy. While for Occupancy dataset, FedDNA does not show much better results than FedAvg when class distribution is less than 2, after which both models the performance of the two models tends to be consistent.

Discussion
In this paper, we propose a dynamic node matching method for federated learning. We argued that neural networks are inherently non-transparent and unstable, and the same network structure may end up with very different weight values, even with the same training data and same parameter settings. Traditionally, existing methods, such as FedAvg, force neurons across sites to be matched with predefined order, and use fixed matching nodes during the FL learning process. Alternatively, we proposed a dynamic node alignment, FedDNA, approach which dynamically finds matching nodes across sites, and uses matched nodes to calculate weight for FL learning. FedDNA represents each neuron as a vector, using their weight values, and calculate distances between neurons to find matching nodes. Meanwhile, because finding marching nodes are computationally expensive, we proposed a minimum spanning tree (MST) based approach to speed up the matching, with matched nodes across all sites being linked by using an MST tree. So the matching process is simply the MST tree growing process.  Experiments and comparisons, including biased sample distributions, validate the performance of the FedDNA, compared to other baseline.
Future study can emphasize on the following three directions. First, we only studied dense networks and verified its performance using FedDNA. Future study can try to explore node matching between different types of network architectures, such as convectional nueral networks. Second, we only studied the proposed design using binary classification problems. In the future, multi-class classification problem will be explored using our proposed FedDNA method. For the last direction, we will use non-IID datasets to further adjust our model so that it can be applied not only to binary classification problem but also can achieve good results for datasets with different settings.